Introduction to NLP — MSc. DH EdC-PSL
October 29, 2025
Introduction to modern NLP models and methods.
Contact: noe.durandard@psl.eu
GitHub Repository: d-noe/NLP_DH_PSL_Fall2025
Hands-On: Google Colab, Local Machines, Binder, …
- simulate behaviour of the real world
- understand which events are in better agreement with the world
- predict the next event given a description of "context" (current state)
- Same!
- But what are these *events*?
The linguistic events used in Language Models are linguistic unit: text, sentence, word, token, character, …
Ambiguous Definitions?
In the context of this lecture, you can consider these atomic units, or tokens, as words, but keep in mind the most appropriate unit can depend on the application, and that:
Language Model
A Language Model (LM) estimates the probability of pieces of text.
Given a sequence of text \(w_{1},w_{2},\cdots,w_{S}\), it answers the question:
What is \(P(w_{1},w_{2},\cdots,w_{S})\)?
… but also the other way around: use DH to study LMs.
Recall:
Language Model
A Language Model (LM) estimates the probability of pieces of text.
How to compute: \(P(w_{1},w_{2},\cdots,w_{S})\)?
What is the most probable piece of text?
\(\to\) quite intuitive, but how are machines supposed to understand it?
Conditional Probabilities: \[P(B|A) = P(A,B)/P(A)\ \Leftrightarrow\ P(A,B)=P(A)P(B|A)\]
The Chain Rule in general: \[ \begin{align} P(x_1,x_2,x_3,\cdots x_S) =&\ P(x_1)P(x_2|x_1)P(x_3|x_1,x_2) \\ &\quad \cdots P(x_S|x_1,x_2,x_3,\cdots x_{S-1}) \\ =& \prod_{i=1}^{S} P(x_i|x_1\cdots x_{i-1}) \end{align} \]
State only depends on \((n-1)\) preceeding states:
\(P(x_i|x_1\cdots x_{i-1}) \approx P(x_i|\)\(x_{i-n+1}\)\(\cdots x_{i-1})\)
\(\\\)
\(\Rightarrow P(x_1,x_2,x_3,\cdots x_S)\)
\(\approx \prod_{i=1}^S P(x_i|\)\(x_{i-n+1}\)\(\cdots x_{i-1})\)
\(\to\) reduces “context” and allows simple (count-based) computation of \(n\)-gram probabilities.
The parameters of the model are: \(P(x_i\cdots x_{i+n-1})\), which can be estimated on some corpus.
\(P(w_{1},w_{2},\cdots,w_{S}) = \prod_i P(w_i) = P(w_1)P(w_2)\cdots P(w_S)\)
\(\to\) What is the most probable sequence?
\(P(\text{the a the a the a})\)
\(=P(\text{the})P(\text{a})P(\text{the})P(\text{a})P(\text{the})P(\text{a})\)
\(\\\)
\(P(\text{the cat sat on the mat})\)
\(=P(\text{the})P(\text{cat})P(\text{sat})P(\text{on})P(\text{the})P(\text{mat})\)
\(P(w_{1},w_{2},\cdots,w_{S}) = \prod_i P(w_i|w_{i-1}) = P(w_1)P(w_2|w_1)\cdots P(w_S|w_{S-1})\)
\(\to\) What is the most probable sequence?
\(P(\text{the a the a the a})\)
\(=P(\text{the})P(\text{a|the})P(\text{the|a})P(\text{a|the})P(\text{the|a})P(\text{a|the})\)
\(\\\)
\(P(\text{the cat sat on the mat})\)
\(=P(\text{the})P(\text{cat|the})P(\text{sat|cat})P(\text{on|sat})P(\text{the|on})P(\text{mat|the})\)
The students opend their [?]
\(\to\) \(4\)-gram model: \(p(\text{?}|\text{students opened their})\)
Sample from distribution: \(\to\) The students opend their books
Why n-grams?
Note
All modern neural NLP techniques actually focus on n-grams, estimating various kinds of related probabilities.
Limitations:
100+ years of LMs in 30s:
Language Modeling problem
How to compute \(P(w_{1},w_{2},\cdots,w_{S})\) or \(P(w_i|w_1\cdots w_{i-1})\)?
Count-based models: rely on explicit co-occurrence counts.
\(P(w_i|w_1\cdots w_{i-1})\approx\frac{\mathrm{C}(w_{i-n+1}\cdots w_{i-1}w_i)}{\mathrm{C}(w_{i-n+1}\cdots w_{i-1}w_i)}\)
Neural models: learn a function \(f_\Theta\) that models NL
\[P(w_i|w_1\cdots w_{i-1})=f_{\Theta}(w_{1},w_{2},\cdots,w_{i-1})\]
\(\to\) + based on continuous vector embedding (make semantic emerge).
\[ P(w_t|w_{t-n+1:t-1}) = \text{softmax}(g_\theta(e(w_{t-n+1}),\dots,e(w_{t-1}))) \]
✅ Captures semantic similarity
🚫 Fixed-size context → still limited like n-gram
Motivation
RNN idea:
Note
Many variants have been developed, e.g. Long Short-Term Memory (LSTM) to control information flow (Hochreiter and Schmidhuber 1997) (+later bi-LSTM, context from both sides), Gated Recurent Units (GRUs) simplified approach (Cho et al. 2014). Both mitigate the vanishing gradients arising in traditional RNNs and allow longer dependencies.
\(\to\) These challenges motivated attention mechanisms and the Transformer architecture.
Extracted from (Mittal 2024).
\[\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V\]
where, for a given input \(X\in\mathbb{R}^{N\times d}\) and learnable projection matrices \(W^Q\in\mathbb{R}^{d\times d_k}\), \(W^K\in\mathbb{R}^{d\times d_k}\), and \(W^V\in\mathbb{R}^{d\times d_v}\):
\[Q=XW^Q,\quad K=XW^K,\quad V=XW^V\]
\(\to\) can be parallelized on GPUs!
\(\to\) \(QK^T\): attention weights from one word to another.
\(\to\) information flow scaled by attention weights.
Think of: “The cat chased the mouse in the garden.”
Intuition: self-attention captures relationships between tokens, FFN refines token-wise representations.
Implementation: two linear layers with ReLU in-between: \(\mathrm{FFN}(x)=max\left(0, xW_1+b_1\right)W_2+b_2\)
More technicalities: facilitate information flow and convergence.
Overview of pre-trained LM types (from (Li 2022)).
The goal is predict the masked token: All the [MASK] best
The goal is predict the next token: All the very …
\(\to\) We’ll come back to this later.
You can find, access (and share) open-weights LLMs on HuggingFace.
Language Modeling